Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize CPU cache efficiency on dict while it's being rehashed #5692

Merged
merged 7 commits into from
Oct 31, 2023

Conversation

erpeng
Copy link
Contributor

@erpeng erpeng commented Dec 13, 2018

when find a key ,if redis is rehashing, currently we should lookup both tables (ht[0] and ht[1]).
if we use the key's index comparing to the rehashidx,if index < rehashidx,then we can conclude:

  1. it is rehashing(rehashidx is -1 if it is not rehashing)
  2. we can't find key in ht[0] so just continue to find key in ht[1]

The possible performance gain here, is not the looping over the linked list (which is empty),
but rather the lookup in the table (which could be a cache miss).

benchmark: #5692 (comment)

@wenbochang888
Copy link

I also have the same doubts. Do you have new thoughts? Can you explain it to me. Thanks

@erpeng
Copy link
Contributor Author

erpeng commented Oct 14, 2019

I also have the same doubts. Do you have new thoughts? Can you explain it to me. Thanks

I also have the same doubts. Do you have new thoughts? Can you explain it to me. Thanks

we don't need checking ht[0] if index less then rehashidx .Because all elements less then rehashidx have been moving to ht[1]

@wenbochang888
Copy link

perfect!!! I agree with you。
Actually, if this if (table == 0 && idx < d->rehashidx) continue; not exist.
he = d->ht[table].table[idx]; would be NULL, it is also to ht[1].
But you code is more effective.

What does the redis author think? Why he is not merge this request?

@erpeng
Copy link
Contributor Author

erpeng commented Oct 15, 2019

perfect!!! I agree with you。
Actually, if this if (table == 0 && idx < d->rehashidx) continue; not exist.
he = d->ht[table].table[idx]; would be NULL, it is also to ht[1].
But you code is more effective.

What does the redis author think? Why he is not merge this request?

I don't know. Maybe it's not urgent.

@yoav-steinberg
Copy link
Contributor

@erpeng Thanks. This optimization is nice but I wonder how much there is to gain here? Immediately after computing the index we lookup in ht[0] and in our case (if we're rehashing and idx is below rehashidx) we'll always get NULL for he and then avoid the internal while loop and effectively continue. So your optimization only avoids the single extra lookup into the hash table.

I suggest closing this, but let me know if you think otherwise.

@yoav-steinberg yoav-steinberg added the state:to-be-closed requesting the core team to close the issue label Dec 6, 2021
@erpeng
Copy link
Contributor Author

erpeng commented Dec 7, 2021

@erpeng Thanks. This optimization is nice but I wonder how much there is to gain here? Immediately after computing the index we lookup in ht[0] and in our case (if we're rehashing and idx is below rehashidx) we'll always get NULL for he and then avoid the internal while loop and effectively continue. So your optimization only avoids the single extra lookup into the hash table.

I suggest closing this, but let me know if you think otherwise.

I think It's not only about performance.With this change,it's simple and more understandable that we're considering the rehash.While he is NULL is a hidden logic.

@ohye3166
Copy link
Contributor

ohye3166 commented Dec 7, 2021

@erpeng 我就用中文说了哈,有一种情况就是在dictAddRaw中达到了rehash的条件这个时候,rehashidx会被设置为0,而这个时候会添加的key会添加到table[1]中,这个时候如果有人去找这个key,发现rehashidx是0,小于这个idx_0就table[0]找,这时候是找不到的

@sundb
Copy link
Collaborator

sundb commented Dec 7, 2021

@ohye3166 You'd better use English to express your question.
If rehashidx is 0, then we will look in table[0], and it will not be found, but in the next loop we will look in table[1], and it will be found.
Not sure I fully understand you.

@ohye3166
Copy link
Contributor

ohye3166 commented Dec 7, 2021

@ohye3166 You'd better use English to express your question. If rehashidx is 0, then we will look in table[0], and it will not be found, but in the next loop we will look in table[1], and it will be found. Not sure I fully understand you.

yes you are right,my mistake

@oranagra oranagra added state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten and removed state:to-be-closed requesting the core team to close the issue labels Dec 8, 2021
@oranagra
Copy link
Member

oranagra commented Dec 8, 2021

As discussed on the #9898, the possible performance gain here (which might not be measurable), is not the looping over the linked list (which is empty), but rather the lookup in the table (which could be a cache miss).
@erpeng can you refresh this PR?

src/dict.c Outdated Show resolved Hide resolved
zhangshihua003 and others added 3 commits December 8, 2021 17:59
@erpeng
Copy link
Contributor Author

erpeng commented Dec 8, 2021

As discussed on the #9898, the possible performance gain here (which might not be measurable), is not the looping over the linked list (which is empty), but rather the lookup in the table (which could be a cache miss). @erpeng can you refresh this PR?

I have refresh this PR.Can you explain the cache miss situation? I can't quite understand this

@sundb
Copy link
Collaborator

sundb commented Dec 8, 2021

Results of testing with redis-server test dict --accurate.
It's odd that the performance has gone down.

unstable

Inserting: 5000000 items in 3499 ms
Linear access of existing elements: 5000000 items in 1378 ms
Linear access of existing elements (2nd round): 5000000 items in 1397 ms
Random access of existing elements: 5000000 items in 2176 ms
Accessing random keys: 5000000 items in 1571 ms
Accessing missing: 5000000 items in 1659 ms
Removing and adding: 5000000 items in 2994 ms

this pr

Inserting: 5000000 items in 3545 ms
Linear access of existing elements: 5000000 items in 1606 ms
Linear access of existing elements (2nd round): 5000000 items in 1528 ms
Random access of existing elements: 5000000 items in 2794 ms
Accessing random keys: 5000000 items in 1010 ms
Accessing missing: 5000000 items in 1847 ms
Removing and adding: 5000000 items in 3189 ms

@erpeng
Copy link
Contributor Author

erpeng commented Dec 9, 2021

Results of testing with redis-server test dict --accurate. It's odd that the performance has gone down.

unstable

Inserting: 5000000 items in 3499 ms
Linear access of existing elements: 5000000 items in 1378 ms
Linear access of existing elements (2nd round): 5000000 items in 1397 ms
Random access of existing elements: 5000000 items in 2176 ms
Accessing random keys: 5000000 items in 1571 ms
Accessing missing: 5000000 items in 1659 ms
Removing and adding: 5000000 items in 2994 ms

this pr

Inserting: 5000000 items in 3545 ms
Linear access of existing elements: 5000000 items in 1606 ms
Linear access of existing elements (2nd round): 5000000 items in 1528 ms
Random access of existing elements: 5000000 items in 2794 ms
Accessing random keys: 5000000 items in 1010 ms
Accessing missing: 5000000 items in 1847 ms
Removing and adding: 5000000 items in 3189 ms

How to run this test?

@sundb
Copy link
Collaborator

sundb commented Dec 9, 2021

@erpeng make REDIS_CFLAGS='-DREDIS_TEST' && ./src/redis-server test dict --accurate

@madolson
Copy link
Contributor

@sundb I ran the benchmark myself, and I got different results, here are my numbers (running with 50 million requests instead of the 5 million for accurate on an ARM instance (maybe it's ARM?))

Current code

Inserting: 50000000 items in 51409 ms
Linear access of existing elements: 50000000 items in 24376 ms
Linear access of existing elements (2nd round): 50000000 items in 25144 ms
Random access of existing elements: 50000000 items in 47692 ms
Accessing random keys: 50000000 items in 10201 ms
Accessing missing: 50000000 items in 30661 ms
Removing and adding: 50000000 items in 58017 ms

With this PR

Inserting: 50000000 items in 51375 ms
Linear access of existing elements: 50000000 items in 24618 ms
Linear access of existing elements (2nd round): 50000000 items in 25891 ms
Random access of existing elements: 50000000 items in 48489 ms
Accessing random keys: 50000000 items in 10378 ms
Accessing missing: 50000000 items in 31061 ms
Removing and adding: 50000000 items in 57673 ms

My concern with running with 5 million is that too much of the data might be in CPU cache, and so it might be unrepresentative of real workloads.

@sundb
Copy link
Collaborator

sundb commented Dec 12, 2021

@madolson Use macOS to run dict unittest again.
The results are very similar to yours, but the performance gap in ubuntu vm is really big, especially for the Accessing random keys test.

Unstable:

Inserting: 50000000 items in 23804 ms
Linear access of existing elements: 50000000 items in 14697 ms
Linear access of existing elements (2nd round): 50000000 items in 14641 ms
Random access of existing elements: 50000000 items in 25952 ms
Accessing random keys: 50000000 items in 4269 ms
Accessing missing: 50000000 items in 18416 ms
Removing and adding: 50000000 items in 33679 ms

With this PR

Inserting: 50000000 items in 23702 ms
Linear access of existing elements: 50000000 items in 14648 ms
Linear access of existing elements (2nd round): 50000000 items in 14609 ms
Random access of existing elements: 50000000 items in 25815 ms
Accessing random keys: 50000000 items in 4257 ms
Accessing missing: 50000000 items in 18412 ms
Removing and adding: 50000000 items in 33966 ms

@oranagra
Copy link
Member

I have refresh this PR.Can you explain the cache miss situation? I can't quite understand this

@erpeng following a pointer to look up it's value, or accessing an array at an offset is usually access to memory that wasn't recently accessed, and it not inside the CPU cache, which will result in real memory access (much slower than running a few instructions on stack based variables that are likely to be cached.
so when we know table[i] is NULL, the main gain of skipping the check is not that we skip the while(ht) part, but rather the access to table.

p.s. i don't mind merging this even if if we can't measure the difference (since the change is just adding one line, and not refactoring anything), but what's odd is that some results show degradation, which doesn't make sense.

@madolson
Copy link
Contributor

@sundb I got similar values again on MacOS, but I generally don't trust mac since it can be weird with libMalloc and since it's laptop, has cores that have a variable clock speed.

As far as I can tell, there should be no performance change when doing the random key removal, since nothing around that code base has changed. It could be related to this, https://www.youtube.com/watch?v=r-TLSBdHe1A&t=464s. If we are somehow messing with the branch predictor (which is probably important here) we might be getting a lot of CPU stalls. We probably need better tooling around determining if something is actually improving performance.

@yoav-steinberg
Copy link
Contributor

yoav-steinberg commented Dec 13, 2021

For anyone running the dict benchmarks, I think we should be focusing, in the context of this PR on performance during rehashing. The benchmarks currently wait for rehashing to complete, but this optimization should only be significant while it's running. It might be interesting to add a rehashing benchmark that will somehow do the same tests while rehashing is running.

@madolson
Copy link
Contributor

I'm going to propose we close this and #9898, we don't really have clear evidence there is a performance gain. I think we want to take a more holistic review of the dictionaries anyway to improve memory and/or performance. @oranagra @sundb Let me know if you disagree.

@madolson madolson removed the state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten label Jul 12, 2022
@oranagra
Copy link
Member

i'll go ahead and close 9898 since we concluded this one is better, but i wanna state again that the fact we weren't able to measure any performance gain, was possibly because our benchmark didn't really hit the scenario which we optimized.
i.e. to benchmark the dict while it's doing a rehashing.

I think that the fact we have a long term goal to completely replace the dict, shouldn't conflict with small incremental improvements.

@oranagra oranagra changed the title add guard clause before lookup ht[0] Optimize CPU cache efficiency on dict while it's being rehashed Jul 12, 2022
@oranagra
Copy link
Member

oranagra commented Oct 5, 2023

@judeng following up on the discussion from #11494, i see the main reason we didn't merge this one yet is that we didn't prove it improves performance, but AFAICT these benchmarks weren't done during rehashing, so maybe you can afford to give it a try?

@judeng
Copy link
Contributor

judeng commented Oct 12, 2023

@oranagra thank you for your hint, I'm happy to try it

@judeng
Copy link
Contributor

judeng commented Oct 26, 2023

Now I've tested this pr, the results are similar with @madolson @sundb , no improvement or deterioration in performance could be observed. But through the perf tool, I observed this pr had the less DRAM access.

My test environment was on a bare metal machine with 96 cores, 512GB Dram, the cpu model is Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, and compilation kit is gcc-7.5.0

I devise two different benchmark, the results are consistent.

  1. Use memtier_benmark insert 2^25 keys, and then start the benchmark include get/set, just like [NEW] Improve rehash performance for db and expire dict #11494
  2. Add a new rehashing test in the dict benchmarks: find the keys not existing when rehashing
    int bitsize = 0;
    while ((count = count / 2) != 0) {
        bitsize++;
    }
    dictEmpty(dict, NULL);
    count = 1L << bitsize;
    //Inserting 2^N+1 keys triggers the rehash, ensuring that table[1] exists
    for (j = 0; j < count + 1; j++) {
        int ret = dictAdd(dict, stringFromLongLong(j), (void *)j);
        assert(ret == DICT_OK);
    }

    char key[21];
    start_benchmark();
    for (j = 0; j < count; j++) {
	//Using ll2string reduces the memory operations required to observe rehash performance changes more accurately
        ll2string(key, 21, j);
        key[0] = 'X';
        dictEntry *de = dictFind(dict, key);
        assert(de == NULL);
    }
    end_benchmark("Accessing missing when rehashing");

I also compared the assembler code and found that the gcc7.5 O2 level optimization does not affect the function with the new branch judgment.
gcc -O2 -fverbose-asm dict.c -o dict.s

I then looked at the cache misses and accesses using perf, and it was clear that this pr was doing fewer fetches and misses to DRAM.

$ sudo perf stat -e cache-misses -e cache-references ./redis-server-unstable test dict --accurate
 Performance counter stats for './redis-server-unstable test dict --accurate':

       196,515,412       cache-misses              #   49.121 % of all cache refs    
       400,063,631      cache-references             
                               
$ sudo perf stat -e cache-misses -e cache-references ./redis-server-rehashidx test dict --accurate
 Performance counter stats for './redis-server-rehashidx test dict --accurate':

       187,646,220      cache-misses              #   47.933 % of all cache refs    
       391,479,924      cache-references      

My guess is that this is due to the instruction-level concurrency and the pre-fetching for DRAM in modern cpus, where table[0] is always fetched concurrently with table[1], so reducing table[0] accesses won't impovement performance.
However, I would still recommend to merge this pr, because in real life, there are various scenarios where a machine is under heavy load, such as memory bandwidth hitting a bottleneck, or accessing remote numa memory, and reducing DRAM access is always beneficial.

@oranagra
Copy link
Member

thanks @judeng, i agree we should still merge it.
at least now we have some indication of a positive impact.
few quick questions:

  1. in the benchmark code you wrote, could it be that the dict wasn't doing the rehashing it was meant to do because it was still rehshing from an earlier smaller growth? maybe it's better to call dictExpand(count)
  2. it could be that the benchmark loop iterated too many times, and kept iterating after the rehash completed, in which case averaging down the results.
  3. when you run perf on test dict --accurate, did you comment out the other tests in this test suite? otherwise, they're averaging down the relative results.

@judeng
Copy link
Contributor

judeng commented Oct 27, 2023

@oranagra I get your concerns and update the code, also comment out the others test cases in the dict.c. The results are exactly same as before. Please feel free to add your thoughts.

  • new benchmark:
    int bitsize = 0;
    while ((count = count / 2) != 0) {
        bitsize++;
    }
    dictEmpty(dict, NULL);
    count = 1L << bitsize;
    for (j = 0; j < count; j++) {
        int ret = dictAdd(dict, stringFromLongLong(j), (void *)j);
        assert(ret == DICT_OK);
    }
    dictExpand(dict, count*2);
  // Make rehashidx large enough so that most keys have idx smaller than rehashidx
    while(1) {
        dictRehash(dict, 100);
        if(dict->rehashidx > (count-1000)) break;
    }
    printf("dict->rehashidx:%ld\n", dict->rehashidx);
  
    dictPauseRehashing(dict);
    char key[21];
    start_benchmark();
    for (j = 0; j < count; j++) {
        ll2string(key, 21, j);
        key[0] = 'X';
        dictEntry *de = dictFind(dict, key);
        assert(de == NULL);
    }
    end_benchmark("Accessing missing when rehashing");
  • perf result:
[judeng@fd-test-codis-dash02 src]$ sudo perf stat -e cache-misses -e cache-references ./redis-server test dict --accurate 
dict->rehashidx:4193368
Accessing missing when rehashing: 4194304 items in 990 ms

 Performance counter stats for './redis-server test dict --accurate':

        37,598,655      cache-misses              #   42.897 % of all cache refs    
        87,647,805      cache-references                                            

       5.781412267 seconds time elapsed

[judeng@fd-test-codis-dash02 src]$ sudo perf stat -e cache-misses -e cache-references ./redis-server-rehash test dict --accurate 
dict->rehashidx:4193368
Accessing missing when rehashing: 4194304 items in 912 ms

 Performance counter stats for './redis-server-rehash test dict --accurate':

        36,063,218      cache-misses              #   44.266 % of all cache refs    
        81,470,018      cache-references                                            

       5.719584564 seconds time elapsed

@oranagra
Copy link
Member

Aren't we missing a dictExpand(dict, count); at the beginning (after dictEmpty)?

@judeng
Copy link
Contributor

judeng commented Oct 27, 2023

@oranagra Maybe I missed something.
The count have been updated to  4194304(2^22) through code line count = 1L << bitsize, so the rehashing from an earlier smaller growth have been finished before triggering �a new bigger rehashing. This has the same effect as dictExpand.

Is the following code what you want? It avoids the overhead of inserting the key.

    dictEmpty(dict, NULL);
    dixeExpand(dict, count);
    while(dictIsRehashing(dict)) {
           dictRehash(dict, 100);
    }
    dictExpand(dict, count*2);
  // Make rehashidx large enough so that most keys have idx smaller than rehashidx
    while(1) {
        dictRehash(dict, 100);
        if(dict->rehashidx > DICTHT_SIZE(d->ht_size_exp[0])-2000) break;
    }
    printf("dict->rehashidx:%ld\n", dict->rehashidx);
  
    dictPauseRehashing(dict);
    char key[21];
    start_benchmark();
    for (j = 0; j < count; j++) {
        ll2string(key, 21, j);
        key[0] = 'X';
        dictEntry *de = dictFind(dict, key);
        assert(de == NULL);
    }
    end_benchmark("Accessing missing when rehashing");

@oranagra
Copy link
Member

In your above example the dict is empty.
I meant that we should expand the dict before populating it, so that the population of 4m entries will be into a dict with 4m buckets.
Then we resize it to 8m buckets, loop until the rehash index is high, and then benchmark dictFind.

@judeng
Copy link
Contributor

judeng commented Oct 30, 2023

@oranagra yes, you are right, we do need to fill in some keys. I wanted to minimize unnecessary memory operations for this test to see the impact of this pr on the cpu cache, so I only filled keys in 1/100 of the buckets of dict. The results are even clearer, with dram accesses nearly halved ( 122M 222M vs 128M) and time consumption reduced by 5%.

    dictEmpty(dict, NULL);
    dictExpand(dict, count);
    while(dictIsRehashing(dict)) {
        dictRehash(dict, 100);
    }
    for (j = 0; j < count / 100; j++) {
        int ret = dictAdd(dict, stringFromLongLong(j), (void *)j);
        assert(ret == DICT_OK);
    }
    dictExpand(dict, count*2);
    printf("dict->rehashidx:%ld, dict size:%lu\n", dict->rehashidx, DICTHT_SIZE(dict->ht_size_exp[0]));
  // Make rehashidx large enough so that most keys have idx smaller than rehashidx
    while(1) {
        dictRehash(dict, 1);
        if (dict->ht_used[0] < 10)
            break;
    }
    printf("dict->rehashidx:%ld\n", dict->rehashidx);

    dictPauseRehashing(dict);
    char key[21];
    start_benchmark();
    for (j = 0; j < count; j++) {
        ll2string(key, 21, j);
        key[0] = 'X';
        dictEntry *de = dictFind(dict, key);
        assert(de == NULL);
    }
    end_benchmark("Accessing missing when rehashing");

results:

[judeng@fd-test-codis-dash02 src]$ sudo perf stat -e cache-misses -e cache-references ./redis-server-unstable test dict 50000000
dict->rehashidx:0, dict size:67108864
dict->rehashidx:67107237
Accessing missing when rehashing: 50000000 items in 9167 ms

 Performance counter stats for './redis-server-unstable test dict 50000000':

       126,969,332      cache-misses              #   56.962 % of all cache refs    
       222,900,692      cache-references                                            

      11.311874814 seconds time elapsed
[judeng@fd-test-codis-dash02 src]$ sudo perf stat -e cache-misses -e cache-references ./redis-server-rehash test dict 50000000
dict->rehashidx:0, dict size:67108864
dict->rehashidx:67107237
Accessing missing when rehashing: 50000000 items in 8711 ms

 Performance counter stats for './redis-server-rehash test dict 50000000':

        80,324,778      cache-misses              #   62.366 % of all cache refs    
       128,795,881      cache-references                                            

      10.919625194 seconds time elapsed

@oranagra
Copy link
Member

great.. that's conclusive proof that it's an improvement!

@oranagra oranagra merged commit 4bbb2b0 into redis:unstable Oct 31, 2023
13 checks passed
lyq2333 added a commit to lyq2333/redis that referenced this pull request Nov 29, 2023
soloestoy pushed a commit that referenced this pull request Nov 30, 2023
…UnlinkFind (#12818)

In #5692, we optimize CPU cache efficiency on dict while rehashing but
missed the modification in dictTwoPhaseUnlinkFind.

This PR supplements it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

8 participants